Store-Sales Time Series Forcasting from Kaggle.
This study will be forcusing on time-series forecasting for store
sales. The data is extracted from Kaggle.
The data is provided by an Ecuador company known as Corporación
Favorita.
I will be exploring the modeltime library to conduct
nested forecasting for the dataset provided.
There are a total of 54 stores and 33 product families in the data.
The time series starts from 01 Jan 2013 and finishes in 31 Aug 2017. The
data is splitted to train and test data, and
the dates in the test data are 15 days after the last date in the
training data.
The dataset consist of 6 different worksheets. The description of each worksheets is as such:
Train.csv contains time series of the stores and the
product families combination.
test.csv contains similar features as the training
data.
Stores.csv contains the metadata of all the stores
participating in this analysis.
Transactions.csv contains the number of transactions
recorded by stores throughout the period of analysis.
Oil.csv contains the daily oil price. Includes
values during both the train and test data time frames.
holidays_events.csv contains all the different
holidays and event that happen throughout the period of analysis.
In addition to the data provided through the data set, there are two pointers to be aware of:
Wages in the public sector are paid every two weeks on the 15 th and on the last day of the month. Supermarket sales could be affected by this.
A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake
These are the libraries used for the analysis.
pacman::p_load(tidyverse, tidymodels,
timetk, modeltime,ggstatsplot,lubridate, trelliscopejs, seasonal,
tsibble, feasts, fable, forecast,psych)
oil <- read_csv ("rawdata/oil.csv")
holiday <- read_csv ("rawdata/holidays_events.csv")
test <- read_csv ("rawdata/test.csv")
train <- read_csv ("rawdata/train.csv")
stores <- read_csv ("rawdata/stores.csv")
transacation <- read_csv ("rawdata/transactions.csv")
describe(oil)
vars n mean sd median trimmed mad min max
date 1 1218 NaN NA NA NaN NA Inf -Inf
dcoilwtico 2 1175 67.71 25.63 53.19 66.96 18.7 26.19 110.62
range skew kurtosis se
date -Inf NA NA NA
dcoilwtico 84.43 0.32 -1.61 0.75
oilplot <- ggplot(oil,aes(x = date,y=dcoilwtico)) +
geom_line(colour = "#468499") +
ylim(25,115) +
theme_classic() +
labs(y = "Oil Price", x = "Date", title = "Daily Oil Price from 2013 - 2017", subtitle = "Without Linear Interplolation") +
theme(axis.title.y = element_text(angle=0))
oilplot
ts_oil <- oil %>%
as_tsibble(index = `date`)
ts_oil$dcoilwtico <- (na.interp(ts_oil$dcoilwtico))
oilplot <- ggplot(ts_oil,aes(x = date,y=dcoilwtico)) +
geom_line(colour = "#D2288A") +
ylim(25,115) +
theme_classic() +
labs(y = "Oil Price", x = "Date", title = "Daily Oil Price from 2013 - 2017", subtitle = "With Linear Interplolation") +
theme(axis.title.y = element_text(angle=0))
oilplot
We assume that the oil prices does not change during the weekends or
holidays. Therefore, we use the fill function to fill up
the weekends NA value based on the previous oil price values.
ts_oil_fill <- ts_oil %>%
complete(date = seq.Date(min(date), max(date), by = "day" )) %>%
fill (dcoilwtico)
ts_oilfill <- ggplot(ts_oil_fill,aes(x = date,y=dcoilwtico)) +
geom_line(colour = "#D2288A") +
ylim(25,115) +
theme_classic() +
labs(y = "Oil Price", x = "Date", title = "Daily Oil Price from 2013 - 2017", subtitle = "With Linear Interplolation and fill") +
theme(axis.title.y = element_text(angle=0))
ts_oilfill
train_correlation <- train %>%
group_by (date) %>%
summarise (total_sales = sum(sales)) %>%
ungroup()
train_oil <- train_correlation %>%
left_join (ts_oil_fill, by = "date")
ggscatterstats(
data = train_oil,
x = total_sales,
y = dcoilwtico,
xlab = "Total Sales",
ylab = "Daily Oil Price",
title = "Checking for Correlation between Sales and Oil prices ",
type = "np"
)
In this analysis, we will analyse if the sales of every product family is affected by oil prices.
train_productfamily <- train %>%
group_by (date,family) %>%
summarise (total_sales = sum(sales)) %>%
ungroup()
train_PF <- train_productfamily %>%
left_join (ts_oil_fill, by = "date")
grouped_ggscatterstats(
data = train_PF,
x = total_sales,
y = dcoilwtico,
grouping.var = family,
xlab = "Total Sales",
ylab = "Daily Oil Price",
type = "np",
plotgrid.args = list(nrow = 8)
)